Skip to content

Conversation

@martin-steinegger
Copy link
Member

No description provided.

@yanyanshao
Copy link

Hello @martin-steinegger

This is a very impressive and crucial enhancement for large-scale clustering.

We are currently facing a project that requires clustering ~11 billion (1.1e10) protein sequences.

Could you please advise if there is a version of MMseqs2 (like a branch from this PR) that is already capable of handling a dataset of this scale?

If a single run is not yet feasible, what would be the recommended strategy? For example, is the "split-cluster-merge" approach the best practice? Have you conducted any scalability tests or benchmarks for clustering at this unprecedented scale (e.g., tens of billions of sequences)?

Any guidance or insights from you would be immensely helpful for our work. Thank you for developing and continuously improving this fantastic tool!

@milot-mirdita
Copy link
Member

This PR is very much in development and not production ready. Our current recommendation is still too split the databases into 2-3 billion sequence chunks, cluster each separately. Afterwards, continue to merge the chunks until you reach 2-3 billion again and cluster until everything is done.

We are of course interested in getting native support into MMseqs2 for this, but this might still take a bit.

@milot-mirdita
Copy link
Member

We have clustered ~100B with the split and merge strategy before.

@yanyanshao
Copy link

@milot-mirdita
Thanks a lot for the detailed recommendation! I’ve got a follow-up question: after splitting the database into chunks and clustering each one individually, could you share the specific steps for merging these separate clustering results (before we re-cluster the combined set once it hits 2-3 billion sequences again)? I’d really appreciate some concrete guidance here.

@yanyanshao
Copy link

@milot-mirdita , can you share the step-by-step example (including commands) for this split-cluster-merge workflow in MMseqs2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants